Abstract Component models such as factor analysis can be used to analyse spatial distributions of a large number of different features – for instance the isogloss data in a dialect atlas, or the distributions of ethnological or archaeological phenomena – with the goal of finding dialects or similar cultural aggregates. However, there are several such methods, and it is not obvious how their differences affect their usability for computational dialectology. We attempt to tackle this question by comparing five such methods using two different dialectological data sets. There are some fundamental differences between these methods, and some of these have implications that affect the dialectological interpretation of the results.
INTRODUCTION
Languages are traditionally subdivided into geographically distinct dialects, although any such division is just a coarse approximation of a more fine-grained variation. This underlying variation is usually visualised in the form of maps, where the distribution of various features is shown as isoglosses. It is possible to view dialectal regions, in this paper also called simply dialects, as combinations of the distribution areas of these features, where the features have been weighted in such a way that the differences between the resulting dialects are as sharp as possible. Ideally, dialect borders are drawn where several isoglosses overlap.
As more and more dialectological data is available in electronic form, it is becoming increasingly attractive to apply computational methods to this problem. One way to do this is to use clustering methods (e.g. Kaufman and Rousseeuw, 1990), especially as such methods have been used in dialectometric studies (e.g. Heeringa and Nerbonne, 2002; Moisl and Jones, 2005).